pacman::p_load(dplyr, rvest, olsrr, ggpubr, sf, spdep, GWmodel, tmap, gtsummary, readr, corrplot, tidyverse, httr, jsonlite, broom)Take-home_Ex03
Setting the Scene
Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.
Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.
Objectives
Predict the HDB resale prices (4-room) for the month of January and February 2023 using data from 1st January 2021 to 31st December 2022 as the training dataset.
Compare the results of the prediction by the conventional OLS method and the Geographically Weighted method.
1. Setup
Referencing Senior’s work
1.1 Datasets
Aspatial Data
- HDB resale prices in Singapore from January 2021 to February 2023. It is in csv format and can be downloaded from Data.gov.sg.
Geospatial Data
- 2019 Master Plan Planning Subzone
Locational factors with geographic coordinates
Childcare data in geojson format.
Eldercare data in shapefile format.
Hawker centre data in geojson format.
Parks data in geojson format.
MRT stations data in shapefile format.
Supermarkets data in geojson format.
Locational factors without geographic coordinates
CBD Coordinates to be scraped and obtained from Google
Good primary schools scraped from Local Salary Forum
1.2 Install and Load Packages
1.3 Loading data
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")1.3.1 Filtering Data
We will filter our data to focus on:
4-Room HDB flat sub-market.
prices from 1st January 2021 to 31st December 2022
resale_focused <- filter(resale, flat_type == "4 ROOM") %>%
filter(month >= "2021-01" & month <= "2022-12")2. Data Wrangling
2.1 Transforming resale data
We require the Area of the unit, Floor level, Remaining lease (months), Age of the unit.
The data is currently in Years and Months. We will have to standardise it so that calculation will be easier.
Our data also does not have coordinates. We will hence have to concatenate the block number and street name, then retrieve the coordinates from OneMap.
To get the age of unit, we need to subtract
lease_commencement_datefrommonth
rs_transform <- resale_focused %>%
mutate(resale_focused, address = paste(block, street_name)) %>%
mutate(resale_focused,
remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(resale_focused,
remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11))) %>%
mutate(resale_focused,
age_of_unit = as.integer(str_sub(month, 0, 4)) - as.integer(lease_commence_date))Now, we have to:
Convert number of years to months by multiplying by 12 for age of unit and remaining lease year.
Convert all the NAs in
remaining_lease_mthinto 0s so that we can add the two columns together.Store this in a new column called
remaining_lease_mths.
rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform$age_of_unit <- rs_transform$age_of_unit * 12
rs_transform <- rs_transform %>%
mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, age_of_unit,
lease_commence_date, remaining_lease_mths, resale_price)2.2 Retrieve Postal Codes and Coordinates of Addresses
2.2.1 Create a list storing unique addresses
We create a list to store unique addresses to ensure that we do not run the GET request more than what is necessary
We can also sort it to make it easier for us to see at which address the GET request will fail.
Here, we use unique() function of base R package to extract the unique addresses then use sort() function of base R package to sort the unique vector.
address_list <- sort(unique(rs_transform$address))2.2.2 Creating a function to retrieve coordinates from OneMap
1. Firstly, we create a dataframe called `postal_coords` to store all the final retrieved coordinates
2. Secondly, we first use *GET()* function of httr package to make a GET request to [*https://developers.onemap.sg/commonapi/search*](https://developers.onemap.sg/commonapi/search)
- OneMap SG offers functions for us to query spatial data from the API in a tidy format and provides additional functionalities to allow easy data manipulation.**
- Here, we will be using their REST APIs to search address data for a given search value and retrieve the coordinates of the searched location.**
- The required variables to be included in the GET request is as follows:
- **`searchVal`**: Keywords entered by user that is used to filter out the results.
- **`returnGeom`** {Y/N}: Checks if user wants to return the geometry.
- **`getAddrDetails`** {Y/N}: Checks if user wants to return address details for a point.
- **Note**:
- The JSON response returned will contain multiple fields.
- However, we are only interested in the postal code and coordinates like Latitude & Longitude.
- On their website, they also made an announcement on a minor text fix where they changed the word \"LONGTITUDE\" to \"LONGITUDE\" which we will be using the latter in this analysis.
3. We then create a dataframe `new_row` which will be used to store each final set of coordinates retrieved during the loop
4. We also need to check the number of responses returned and append to the main dataframe accordingly. This is because:
- The no. of returned responses of the searched location, (indicated by variable `found`) , varies as some location might have only a single result while other locations might return multiple results.
- For example, the address 2 JLN BATU returns 3 sets of postal codes and coordinates ( meaning `found` = 3).
- Hence, what we can do is to first look at only those that does not have empty postal codes then take the first set/row of the coordinates
- We can also check to see if the address is invalid by looking at the number of rows returned by request.
- There will also be some addresses searched that are invalid. ( means `found` = 0)
- This step was helpful in determining what was causing the error of the API Call. We will see in the later section what errors was caused by the invalid searched errors.
5. Lastly, we will append the returned response (`new_row`) with the necessary fields to the main dataframe (`postal_coords`) using *rbind()* function of base R package.
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://developers.onemap.sg/commonapi/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
}
else {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}2.2.3 Retrieve resale coordinates
coords <- get_coords(address_list)Check if there are any NA/NIL values
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]Looking for 215 Choa Chu Kang Central on OneMap returns us with “Blk 215 and 216 Choa Chu Kang Central” with two different postal codes. This could be the reason why a postal code was not assigned. However, since we have the coordinates, we can proceed.
2.2.4 Combine resale and coordinates data
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))head(rs_coords)rs_coords_final <- rs_coords %>% select(c("latitude", "longitude"))Write file to rds
rs_coords_rds <- write_rds(rs_coords, "data/aspatial/rds/rs_coords.rds")2.3 Transform CRS and check
Read RDS
rs_coords <- read_rds("data/aspatial/rds/rs_coords.rds")Since the coordinates are in decimal degrees, the projected CRS will be WSG84.
We will hence need to assign it as CRS 4326 before transforming it to 3414 which is the EPSG code for SVY21.
convert data frame into sf object
transform the coordinates of the sf object
rs_coords_sf <- st_as_sf(rs_coords,
coords = c("longitude",
"latitude"),
crs = 4326) %>%
st_transform(crs = 3414)Check EPSG
st_crs(rs_coords_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
2.3.1 Check for invalid geometries
length(which(st_is_valid(rs_coords_sf) == FALSE))[1] 0
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="blue", size = 0.02)